Emergency Sound Detection Using YAMNet and Custom Training

May 21 2025

IOT

Emergency Sound Detection Using YAMNet and Custom Training

In recent years, sound classification has become a valuable tool in real-time monitoring systems, particularly for public safety and smart surveillance. One powerful model for sound classification is YAMNet, developed by Google. In this blog post, we will walk you through how we used YAMNet along with a custom-trained dataset to build an emergency sound detection system capable of identifying critical sounds like gunshots, glass breaks, and animal attacks.

What is YAMNet?

YAMNet (Yet Another Mobile Network) is a pre-trained deep learning model that classifies audio into a wide range of categories based on the AudioSet ontology by Google. It uses a MobileNet v1 architecture, which is lightweight yet efficient, making it suitable for real-time audio classification tasks.

Here’s how YAMNet works:

It takes an input audio waveform (mono, 16 kHz sample rate).
The model first converts the audio into log-mel spectrogram features.
These features are passed through a convolutional neural network (CNN).
Finally, the model outputs a list of predicted sound classes (from over 500 possible classes in AudioSet), each with a confidence score between 0 and 1.

Emergency Sound Detection Workflow

We built an emergency sound detection system that works in real-time and can identify high-risk sounds such as:

Gunshots (e.g., pistol, rifle, shotgun)
Glass breaking
Animal roars (e.g., tiger, lion, dog)

Here’s the basic workflow we followed:

Dataset Preparation: Added a varied dataset of emergency sounds such as gunshot, glass break, howls, etc, from various sources.

Feature Extraction using YAMNet: We used YAMNet to extract embeddings (1024-dimensional feature vectors) from each audio file in the dataset. These embeddings represent the key characteristics of each sound.

Training a Custom Classifier: We trained a machine learning model (such as Random Forest or MLP) using the extracted embeddings to classify emergency sounds into specific categories.

Real-Time Detection: We implemented a live microphone input system that:

Continuously captures audio
Extracts embeddings using YAMNet
Feed them into the trained classifier
Displays alerts when an emergency sound is detected

Implementation Overview

After designing the workflow, we implemented the project using Python and TensorFlow. The implementation was divided into three main parts:

1. Audio Feature Extraction using YAMNet

We used the pre-trained YAMNet model to extract embeddings from my emergency audio dataset. These embeddings are high-level features that represent the sound in a compact form.

# Example: Extract embeddings from an audio file

import tensorflow as tf

import yamnet as yamnet_model

import soundfile as sf

import numpy as np

# Load the YAMNet model

yamnet = yamnet_model.yamnet_frames_model()

yamnet.load_weights('yamnet.h5')

# Load the audio

waveform, sr = sf.read('sample_gunshot.wav')

if sr != 16000:

# Resample if needed

pass

# Extract embeddingsscores, embeddings, spectrogram = yamnet(waveform)

2. Training the Custom Classifier

Once we had the embeddings, we trained a custom classifier using scikit-learn. This allowed me to fine-tune detection for only a few important classes, such as gunshot, glass_break, and dog_bark.

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

# Load your labeled embeddings

X = ... # YAMNet embeddings

y = ... # Corresponding labels

# Train/test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train classifier

clf = RandomForestClassifier()

clf.fit(X_train, y_train)

3. Real-Time Detection System

For real-time detection, we created a script that listens to the microphone input, processes small audio chunks using YAMNet, and classifies them with the trained model. When it detects a critical sound, it shows an alert.

import sounddevice as sd

def callback(indata, frames, time, status):

# Preprocess live audio input

# Extract YAMNet embeddings

# Predict class with trained classifier

# Display alert if necessary

pass

# Start microphone stream

sd.InputStream(callback=callback).start()

Use Cases of Emergency Sound Detection

This system can be used in a variety of safety-focused environments:

Detect breaking glass or animal sounds near the home.
Public Surveillance: Detect gunshots or distress sounds in public areas.
Wildlife Monitoring: Identify animal roars to track potential threats or endangered species.
Industrial Safety: Detects accidents, crashes, or dangerous sound patterns in factories.
School or Campus Safety: Gunshots or screams in hallways or outdoors, fights or aggressive sounds.
Wildlife Monitoring & Safety: Use in: farms or forest-edge homes, detect animal roars or dog barks

Conclusion

By combining YAMNet’s powerful feature extraction with a custom-trained classifier tailored for emergency audio events, we’ve built a scalable and accurate detection system. This solution can be integrated into smart security systems, healthcare monitoring tools, and public safety infrastructure to improve response times and reduce risks.

The modular design allows it to adapt to different scenarios with minimal changes. As we continue to improve accuracy and reduce false positives, this system can become a reliable layer of safety for both consumers and enterprises.